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ABSTRACT 

Clustering  of  individuals,  segmentation  of  time  series  and 
segmentation  of  numerical  images  can  all  be  considered  as  labeling 
problems,  for  each  can  be  described  in  terms  of  pairs  (xt,gt) •  t  = 

1,2,  ...,n,  where  xt  is  the  observation  at  instance  t  and  gt  is 
the  unobservable  "label’*'  of  instance  t.  The  labels  are  to  be 
estimated,  along  with  any  unspecified  distributional  parameters.  In 
cluster  analysis  the  values  of  t  are  the  individuals  (cases)  observed 
and  the  x's  are  independent.  In  time  series  the  values  of  t  are  time 
instants  and  there  is  temporal  correlation.  In  numerical  image 
segmentation  the  values  of  t  denote  picture  elements  (pixels)  and 
spatial  correlation  between  neighboring  pixels  can  be  utilized.  The 
idea  in  segmentation  is  that  signals  and  time  series  often  are  not 
homogeneous  but  rather  are  generated  by  mechanisms  or  processes  with 
various  phases.  Similarly,  images  are  not  homogeneous  but  contain 
various  objects.  "Segmentation"  is  a  process  of  attempting  to  recover 
automatically  the  phases  or  objects.  The  present  report  summarizes  the 
work  done  on  these  problems  under  ARO  Contract  DAAG29-82-K-0 1 55 • 

Key  words  and  phrases: 

statistical  pattern  recognition’,  classification; 
temporal  correlation,  spatial  correlation; 
optimization  by  relaxation  method. 
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1  .  I ntroduct i on 

The  research  reported  here  relates  to  cluster  analysis  and  to 
statistical  processing  of  time  series  and  digitized  images.  This 
report  is  a  summary  of  work  performed  under  ARO  Contract 
OAAG29-82-K-OI55  (6/15/82  -  6/15/85):  Statistical  Models  and  Methods 
for  Cluster  Analysis  and  Image  Segmentation.  The  type  of  datasets  to 
which  the  techniques  developed  are  applicable  include:  signals  such  as 
radar  and  sonar;  economic  and  bio-medical  time  series;  time  series 
arising  from  quality  assurance  acceptance  sampling  by  attributes  or 
variables;  and  digital  images  which  can  result  from  various  sources, 
including  bio-medical  imagery,  infrared  imagery  obtained  by  smart 
munitions,  and  mu  1 1 i spectra  I  data  obtained  by  satellite.  The  problems 
addressed  are  those  of  clustering,  and  segmentation  of  time  series  and 
images . 

The  work  involves  the  further  development  of  algorithms  for 
clustering  large,  multidimensional  datasets  and  for  segmentation  of 
time  series  and  digital  images.  The  algorithms  are  based  on  maximum 
likelihood  estimation  in  distribution-mixture  models.  In  the  context 
of  these  mixture  models  clustering  is  construed  as  estimation  of 
unobserved  labels.  An  observation's  label,  were  it  observable,  would 
tell  from  which  mixture  component  the  observation  arose.  Image 
segmentation  is  also  considered  as  a  labeling  problem.  Throughout  the 
work  there  is  an  attempt  to  apply  mode  1 -se 1 ect i on  criteria  to  the 
decision  as  to  an  appropriate  number  of  clusters  or  classes  of  segment. 

Software  development  is  an  important  aspect  of  such  a  project. 


The  algorithms  developed  are  programmed  in  FORTRAN. 
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Some  of  the  ideas  developed  In  the  project  have  already  been 
published;  see  Sclove  (1983a, b,c;  1934a)  and  Bozdogan  and  Sclove 
(1984)  . 

The  organization  of  the  present  paper  is  as  follows:  Section  2 
concerns  cluster  analysis;  in  this  section  there  is  some  general 
discussion  of  mode  I -se I ec t i on  criteria  and  a  digression  to  mention  some 
ideas  concerning  clustering  of  variables.  Section  3  summarizes  some  of 
the  results  on  time-series  segmentation,  and  results  on  image 
segmentation  are  discussed  in  Section  4. 

2 •  Cluster  analysis 

Background.  The  mixture  model  for  the  clustering  problem  treats 
the  sample  as  having  arisen  from  a  mixture  of  several  (k) 
distributions.  This  is  the  approach  put  forth  in  (Sclove  1977)*  The 
research  problem  set  there  was,  at  least  in  part,  to  see  whether  the 
ISODATA  (Ball  and  Hall,  I9&7)  and  K-MEANS  (MacQueen,  1 967)  algorithms 
could  be  interpreted  as  mathemat i cal -s tat i st i ca I  estimation  schemes  in 
some  model  for  the  clustering  problem.  That  is,  did  there  exist  a 
model  for  the  clustering  problem,  and  an  estimation  method  in  that 
model,  such  that  ISOOATA  and  K-MEANS  corresponded  to  that  method 
applied  to  that  model?  The  answer,  provided  in  (Sclove  1977).  was 
affirmative;  this  will  be  explained  below,  but  first  let  us  briefly 
define  ISOOATA  and  K-MEANS. 

The  "isodata"  scheme  proceeds  as  follows.  One  starts  with 
tentative  estimates  of  cluster  means  as  seed  points  for  the  clusters 
and  assigns  each  observation  to  the  mean  to  which  it  is  closest.  The 


cluster  means  are  then  re-estimated,  and  one  loops  through  the  data 
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again,  reassigning  the  observations.  Etc.  In  the  K-MEANS  algorithm, 
the  seed  points  are  updated  immediately  after  each  observation  is 
tentatively  classified.  In  (Sclove  1977)  it  was  shown  that  these 
algorithms  correspond  to  iterative  maximum  likelihood  estimation  in  a 
type  of  mixture  model  for  the  clustering  problem,  where  the  component 
distributions  are  multivariate  normal. 

This  clustering  can  be  done  for  various  values  of  k,  the  number  of 
clusters.  Figures  of  merit  can  be  used  to  choose  the  best  k. 
Model -selection  criteria  can  be  used  as  figures  of  merit. 


2.1.  Model -se 1 ect ion  criteria 

In  the  context  of  a  mixture  model,  choice  of  the  number  of 
clusters  k  can  be  viewed  as  a  model -sel ect ion  problem.  However, 
at  least  in  the  case  of  clustering  individuals,  existing 
model -sel ect i on  criteria  have  to  be  modified,  as  they  depend  upon 
(regularity)  assumptions  that  are  not  always  met  in  mixture  models 
for  clustering  individuals. 

In  any  case,  let  us  review  some  of  the  existing  model -selection 
criteria.  Consider,  then,  a  problem  of  choosing  from  among  several 
models,  indexed  by  k  (k  =  1 , 2 , . . . , K)  .  Let  L  (k)  be  the  likelihood, 
given  the  k-th  model.  Various  model -selection  criteria  taking  the  form 

-2  log  (max  L  (k) )  +  a(n)m(k)  +  b  (k)  ,  (1) 

* 

have  been  developed  in  relatively  recent  years.  Here  n  is  the  sample 
size,  log  denotes  the  natural  logarithm,  max  L  (k)  denotes  the  maximum 
of  the  likelihood  over  the  parameters,  and  m(k)  is  the  number  of 
independent  parameters  in  the  k-th  model.  For  a  given  criterion,  a  (n) 


is  the  cost  of  fitting  an  additional  parameter  and  b(k)  is  an 


'JL 
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additional  term  depending  upon  the  criterion  and  the  mode!  k.  One 
chooses  the  model  k  for  which  the  value  of  the  criterion  being  used  is 
sma I  lest. 

Akaike  (see,  e.g.,  Akaike  1973.  197**.  198')  developed  such  a 

criterion  as  an  (heuristic)  estimate  of  the  expected  entropy 
(Ku I  I  back -Le i b I er  information).  Akaike's  information  criterion  (AIC) 
i s  of  the  form  ( 1 )  with 

a  (n)  =  2  for  all  n,  b  (k)  =  0  (AIC).  (2) 

Schwarz  (1978),  working  from  a  Bayesian  viewpoint,  obtained  a  criterion 
of  the  form  (1)  with 

a  (n)  =  log  n,  b (k)  =  0  (Schwarz'  criterion).  (3) 

Since,  for  n  greater  than  8,  log  n  exceeds  2,  it  follows  that 
Schwarz'  criterion  favors  models  with  fewer  parameters  than  does 
Akaike' s. 

Noting  that  AIC  has  a (n)  a  constant  function  of  n,  namely  2, 
various  researchers,  including  Kashyap  (1982)  and  Schwarz  (1978)  have 
mentioned  that  AIC  is  not  consistent;  a  (n)  needs  to  depend  upon  n. 

Kashyap  (1982),  also  working  from  a  Bayesian  approach,  took  the 
asymptotic  expansion  of  the  logarithm  of  the  posterior  probabilities  a 
term  further  than  did  Schwarz  and  obtained  the  criterion  of  the  form 
(1)  given  by 

a  (n)  =  log  n,  b  (k)  *  log(det  B  (k ) )  (Kashyap1  s  criterion),  (4) 

where  det  denotes  the  determinant  and  B  (k)  is  the  negative  of  the 
matrix  of  second  partials  of  log  L  (k) ,  evaluated  at  the  maximum 
likelihood  estimates.  In  Gaussian  linear  models  this  is  the  covariance 
matrix  of  the  maximum  likelihood  estimates  of  the  regression 
coefficients;  in  general,  the  expectation  of  B (k) ,  evaluated  at  the 
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5  cu'a; 

true  parameter 

values,  is 

Fisher's 

i nf orma  t i on 

matrix. 

Since  Kashyap's 

criterion  is 

based 

on 

r eason \ ng 

similar  to 

Schwarz ’  , 

but  conta i ns  an 

extra  term, 

i  t 

may 

per  form 

be  t  ter . 

[Fur  ther 

comments  on 

mode  I -se I ec t i on  criteria  are  made  in  Sc  love  ( 1 9 33d).] 
2.2.  Multi-sample  clustering 


The  problem  of  multi-sample  clustering,  the  grouping  of  samples,  t 

is  treated  in  Sozdogan  and  Sclove  (1984).  The  situation  is  the 
K-sample  problem  (one-way  analysis  of  variance),  with  an  emphasis  on 
grouping  the  samples  into  fewer  than  K  clusters.  The  use  of 
mode  I -sel ect ion  criteria  in  this  context  can  provide  an  alternative  to 

multiple-comparison  procedures.  Use  of  model -se I ect i on  criteria  avoids  | 

the  difficult  choice  of  levels  of  significance  in  such  problems. 

Mode  I -se I ect i on  criteria  can  also  be  used  in  this  context  to  decide 

whether  or  not  to  assume  a  common  covariance  matrix.  Kashyap's  , 


criterion  could  be  evaluated  and  used  for  these  problems. 

2.3*  Clustering  of  individuals 

Schwarz'  and  Kashyap's  criteria  could  be  calculated  for  the 
problem  of  clustering  individuals  according  to  Wolfe's  (1970) 
mixture-model  clustering  approach  and  incorporated  into  computer 
programs  for  clustering.  The  values  of  the  criteria  can  be  used 
heur i st i ca I  I y  as  figures  of  merit  for  alternative  models,  but  in  order 


to  be  rigorously 

appl i ed 

the  model-sel 

lection  criteria  need  to 

be 

mod i f i ed  si  nee 

the  i  r 

der i vat i on 

involves  an 

assumpt i on 

of 

nons i ngul ar i ty  of 

the  information  matrix.  However, 

note  in 

th  i  s 

regard  a  potential  advantage  of  mode  1 -se I ect ion  criteria  over  a 
hypothes i s- tes t i ng  approach  in  this  and  similar  situations. 


Mode  I -sel ect i on 


criteria  require  nonsingularity  of  the  information 
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matrix  only  for  each 

f i xed  mode  1 

k  .  The  tes  t i ng 

approach 

runs 

into  difficulties 

because  of 

nonsingularity  of  the 

matrix  at 

the 

boundary  between  the 

nu 1 1  and  a ' 

Iternative  hypotheses 

( i  . e . ,  at 

the 

boundary  between  models) . 

2.4.  Clustering  of  variables 

The  clustering  of  variables  can  also  be  viewed  as  a 
mode  I -se 1 ec t i on  problem.  For  example,  whether  and  how  to  cluster 
multinormal  variables  depends  upon  which  covariances  may  be  assumed  to 
be  zero;  the  possible  patterns  of  zeros  among  the  covariances  are 
separate  models,  a  figure  of  merit  for  which  is  provided  by  a  suitable 
model -selection  criterion.  This  idea  is  to  be  further  developed. 

3.  Time-series  segmentation 

As  mentioned  above,  a  model  for  clustering  or  segmentation  is 
given  by  assuming  that  each  instance  of  observation,  t,  gives  rise  not 
only  to  an  observation  xt  but  also  to  a  label,  gt,  equal  to  1,  2, 

....  or  k,  where  k  is  the  number  of  classes  of  segment. 
Mode  I -sel ec t i on  criteria  are  used  to  estimate  k.  In  the  context  of 
this  model,  segmentation  is  merely  estimation  of  the  labels.  Sclove 
(1983b, c;  1984a)  treats  the  problem  by  modeling  the  label  process  as 
a  Markov  chan.  An  algorithm  and  computer  programs  are  discussed; 
numerical  examples  are  given. 

The  model  involves  three  sets  of  parameters:  the  distributional 
parameters  (e.g.,  means  and  covariance  matrices),  the  labels,  and  the 
transition  probabilities  between  labels. 

The  algorithm  is  a  relaxation  method,  similar  to  the  EM  algorithm. 


The  estimation  step  consists  of  max i mum- I i ke I i hood  estimation  of  the 
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distributional  parameters,  for  tentatively  fixed  values  of  the  labels 

and  transition  probabilities.  The  maximization  step  consists  of 

maximizing  the  likelihood  over  the  labels  and  transition  probabilities, 

for  tentatively  fixed  values  of  the  distributional  parameters. 

As  developed  so  far,  the  algorithm  is  a  forward  algorithm, 

classifying  xg  after  xj,  x^  after  X2  and  xj,  etc.  It  is 

suitable  for  sequential  operation  in  real  time,  but  it  is  non-optimal 

in  other  modes  of  operation.  Its  performance  could  possibly  be 

improved  by  a  backcasting  technique  analogous  to  that  in  Box  and 

Jenkins  (1976)  and  by  application  of  the  Viterbi  algorithm  (Forney 

1973).  which  is  a  recursive  optimal  solution  to  the  problem  of 

estimating  the  state  sequence  of  a  d i sere te- t i me  finite  state 

Markov  process;  it  is  applicable  here  because  this  is  what  we  have 

at  each  stage  when  the  distributional  parameters  and  transition 

probabilities  are  tentatively  fixed  and  the  labels  are  to  be  estimated. 

Further,  the  parameter-estimation  step  of  the  algorithm  can  be 

improved.  The  estimation  implemented  in  the  existing  algorithm  leads 

to  estimates  that  are  biased  (even  asymptotically).  (See,  e.g.,  Bryant 

and  Williamson  1978.)  This  bias  may  be  viewed  as  due  to  the 

truncation  resulting  from  the  algorithm.  The  estimation  could  be 

modified  by  doing  it  in  a  Bayesian  manner,  e.g.,  estimate  the  mean  of 

Class  A  as 

n  n 

2  x t  Pr(a|xt6)/v  Pr(a|xt) 
t=!  t=! 

(In  this  expression,  Pr(a|x)  can  be  replaced  by  Pr(x|a)  since 
Pr  (a) /f  (x)  will  cancel  out.)  This  modification  in  the 
par ame ter -es t i ma t i on  step  can  be  important.  For,  in  this  estimate. 
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all  the  observations  play  a  role,  whether  labeled  as  "Class  A"  or 
otherwise,  so  that  at  least  some  of  the  bias  incurred  by  using  only 
the  "a"  observations  will  be  removed  by  allowing  all  of  the 
observations  to  enter. 

The  work  done  to  date  is  explicit  only  for  the  case  in  which  the 
c  I  ass -cond i t i ona 1  processes  consist  of  independent,  identically 
distributed  random  variables.  The  work  is  to  be  extended  to  other, 
often  more  realistic  cases,  such  as  that  of  autoregression  within 
segments. 


U .  Image  segmentation 

Similar  ideas  are  applied  to  digital  images  in  Sclove 
(1983a; 1984a) .  Here  the  label  process  is  modeled  as  a  Markov  random 
field.  The  same  improvements  made  in  the  time-series  context  will  be 
carried  over  to  the  two-dimensional,  image-processing  context.  For 
example,  computer  experiments  (Sclove  1 98  Ub)  with  the  existing 
algorithm  have  shown  it  to  be  successful,  even  in  finding  small 
targets.  However,  at  the  same  time,  these  experiments  have  shown  the 
importance  of  some  such  modification  as  backcasting,  as  mentioned  in 
connection  with  time  series,  to  eliminate  anomalous  border  effects. 

Extension  of  the  existing  work  to  two-dimensional  autoregressions 


within  segments  will  yield  algorithms  that  may  detect  textures. 
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